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1 Introduction 



Detection of sparse mixtures is an important problem that arises in many scientific applications 
such as signal processing [11], biostatistics [23], and astrophysics [8, 24], where the goal is to 
determine the existence of a signal which only appears in a small fraction of the noisy data. For 
example, topological defects and Doppler effects manifest themselves as non-Gaussian convolution 
component in the Cosmic Microwave Background (CMB) temperature fluctuations. Detection of 
non-Gaussian signatures are important to identify cosmological origins of many phenomena [24]. 
Another example is disease surveillance where it is critical to discover an outbreak when the infected 
population is small [25] . The detection problem is of significant interest also because it is closely 
connected to a number of other important problems including estimation, screening, large-scale 
multiple testing, and classification. See, for example, [6], [7], [12], [17], and [23]. 

1.1 Detection of sparse binary vectors 

One of the earliest work on sparse mixture detection dates back to Dobrushin [11], who consid- 
ered the following problem originating from multi-channel detection in radiolocation. Let Ray(a) 
denote the Rayleigh distribution with the density ^ exp(— y > 0. Let {li}™ =1 be independently 
distributed according to Ray(aj), representing the random voltages observed on the n channels. In 
the absence of noise, ctj's are all equal to one, the nominal value; while in the presence of signal, 
exactly one of the Oj's becomes a known value a > 1. Denoting the uniform distribution on [n] by 
U n , the goal is to test the following competing hypotheses 



Since the signal only appears once out of the n samples, in order for the signal to be distinguishable 
from noise, it is necessary for the amplitude a to grow with the sample size n (in fact, at least 
logarithmically). By proving that the log-likelihood ratio converges to a stable distribution in the 
large-n limit, Dobrushin [11] obtained sharp asymptotics of the smallest a in order to achieve the 
desired false alarm and miss detection probabilities. Similar results are obtained in the continuous- 
time Gaussian setting by Burnashev and Begmatov [5]. 

Subsequent important work include Ingster [20] and Donoho and Jin [12], which focused on 
detecting a sparse binary vector in the presence of Gaussian observation noise. The problem can 
be formulated as follows. Given a random sample {Yi, ...,Y n }, one wishes to test the hypotheses 



: Yi ' AT(0, 1), % e [n] versus H[ n) : Y % (1 - e n )M(0, 1) + e n A% n , 1), i € [n] (2) 



Hq :ctj = l,iG [n], versus H^ 1 ' : en = 1 + (a — l)lr i= ji, J ~ U 1 




where the non-null proportion e n is calibrated according to 




(3) 



and the non-null effect ji n grows with the sample size according to 



fi n = v2rlogn, r 



> 0. 



(4) 



2 



Equivalently, one can write 

Y i = X l + Z, (5) 

where Zi'^'M(0, 1) is the observation noise. Under the null hypothesis, the mean vector X n = 
(X\, . . . ,X n ) is equal to zero; under the alternative, X n is a non-zero sparse binary vector with 
tn^/in j where 5 a denotes the point mass at a. 
The detection boundary, which gives the smallest possible signal strength, r, such that reliable 
detection is possible, is given by the following function in terms of the sparsity parameter /3: 



0-3 \<P<\ (6) 

( i_^r^)2 |</3<r 



See Ingster [20] and Donoho and Jin [12]. Therefore, the hypotheses in (2) can be tested with 
vanishing probability of error if and only if the pair (/3, r) lies in the strict epigraph 

{(/3,r):r>r* (/?)}, (7) 

which is called the detectable region. Furthermore, because the fraction of the non-zero mean is 
very small, most tests based on the empirical moments have no power in detection. Donoho and 
Jin [12] proposed an adaptive testing procedure based on Tukey's higher criticism statistic and 
showed that it attains the optimal detection boundary (6) without requiring the knowledge of the 
unknown parameters (/3,r). 

The above results have been generalized along various directions within the framework of two- 
component Gaussian mixtures. Jager and Wellner [22] proposed a family of goodness-of-fit tests 
based on the Renyi divergences [29, p. 554], including the higher criticism test as a special case, 
which achieve the optimal detection boundary adaptively. The detection boundary with correlated 
noise was established in [16] which also proposed a modified version of the higher criticism that 
achieves the corresponding optimal boundary. In a related setup, [4, 2, 3] considered detecting 
a signal with a known geometric shape in Gaussian noise. Minimax estimation of the non-null 
proportion e n was studied in Cai, Jin and Low [7]. 

The setup of [20] and [12] specifically focuses on the two-point Gaussian mixtures. Although 
[20] and [12] provide insightful results for sparse signal detection, the setting is highly restrictive 
and idealized. In particular, it has the limitation that the signal strength must be a constant 
under the alternative, i.e., the mean vector X n takes constant value fi n on its support. In many 
applications, the signal itself varies among the non-null portion of the samples. A natural question 
is the following: What is the detection boundary if fi n varies under the alternative, say with a 
distribution P n ? Motivated by these considerations, the following heteroscedastic Gaussian mixture 
model was considered in Cai, Jeng and Jin [6]: 

: Yi L ~ ■ AA(0, 1) versus H[ n) : Y { V ^ (1 - e n )M(0, 1) + e n A% n , a 2 ). (8) 

In this case, [6, Theorems 2.1 and 2.2] showed that reliable detection is possible if and only if 
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r > r*(/3, a 2 ) where r*(/3, a 2 ) is given by 



r*(/3,a 



2\ 



(2-cx 2 )(/3-±) \ </3<l-^,a 2 <2 
(1 — a-v/l — otherwise 



where x+ = max(x,0). It was also shown that the optimal detection boundary can be achieved by 
a double-sided version of the higher criticism test. 

1.2 Detection of general sparse mixture 

Although the setup in Cai, Jeng and Jin [6] is more general than that considered in [20] and 
[12], it is still restricted to the two-component Gaussian mixtures. In many applications such as 
the aforementioned multi-channel detection [11] and astrophysical problems [24], the sparse signal 
may not be binary and the distribution may not be Gaussian. In the present paper, we consider 
the problem of sparse mixture detection in a general framework where the distributions are not 
necessarily Gaussian and the non-null effects are not necessarily a binary vector. More specifically, 
given a random sample Y n = {Yi, ...,Y n }, we wish to test the following hypotheses 

: Y { ! ~ ■ Q n versus H[ n) : Y t L ~ ' (1 - e n )Q n + e n G n (10) 

where Q n is the null distribution and G n is a distribution modeling the statistical variations of the 
non-null effects. The non-null proportion e n 6 (0, 1) is calibrated according to (3). 

In this paper we obtain an explicit formula for the fundamental limit of the general testing prob- 
lem (10) under mild technical conditions on the mixture. We also establish the adaptive optimality 
of the higher criticism procedure across all sparse mixtures satisfying certain mild regularity condi- 
tions. In particular, the general results obtained in this paper recover and extend all the previously 
known results mentioned earlier in a unified manner. The results also generalize the optimality and 
adaptivity of the higher criticism procedure far beyond the original equal-signal-strength Gaussian 
setup in [20, 12] and the heteroscedastic extension in [6]. In the most general case, it turns out 
that the detectability of the sparse mixture is governed by the behavior of the log-likelihood ratio 
evaluated at an appropriate quantile of the null distribution. 

Although our general approach does not rely on the Gaussianity of the model, it is however 
instructive to begin by considering the special case of sparse normal mixture with Q n = AA(0, 1), 
i.e., 

'H^: 1-^(0,1) 
H (n) : F .i-y. (1 _ en)Ar(0)1) + enGn " 

It is of special interest to consider the convolution model, where 

G n = P n *M{0,l) (12) 

is a standard normal mixture and * denotes the convolution of two distributions. In this case the 
hypotheses (11) can be equivalently expressed via the additive-noise model (5), where Xi = under 
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the null and — e n )5o + e n P n under the alternative. Based on the noisy observation Y n , 

the goal is to determine whether X n is the zero vector or a sparse vector, whose support size is 
approximately ne n and non-zero entries are distributed according to P n . Therefore, the distribution 
P n represents the prior knowledge of the signal. The case of P n being a point mass is treated in 
[20, 12], The case of Rademacher P n in covered in [21, Chapter 8]. The heteroscedastic case where 
P n is Gaussian is considered in [6]. These results can be recovered by particularizing the general 
conclusion in the present paper. 

Moreover, our results also shed light on what governs detectability in Gaussian noise when the 
signal does not necessarily have equal strength. For example, consider the classical setup (2) where 
the signal strength [i n is now a random variable. If we have /x n = \/2r lognX for some random 
variable X, then the resulting detectable region is given by the Ingster-Donoho-Jin expression (20) 
scaled by the Loo-norm of X. On the other hand, it is also possible that certain distributions of 
/i n induces different shapes of detectable region than Fig. 2. See Sections 3.1 and 5.2 for further 
discussions. 

1.3 Organization 

The rest of the paper is organized as follows. Section 2 states the setup, defines the fundamental 
limit of sparse mixture detection and reviews some previously known results. The main results of 
the paper are presented in Sections 3 and 4, where we provide an explicit characterization of the 
optimal detection boundary under mild technical conditions. Moreover, it is shown in Section 4 
that the higher criticism test achieves the optimal performance adaptively. Section 5 particularizes 
the general result to various special cases to give explicit formulae of the fundamental limits. 
Discussions of generalizations and open problems are presented in Section 6. The main theorems 
are proven in Section 7, while the proofs of the technical lemmas are relegated to the appendices. 

1.4 Notations 

Throughout the paper, and (p denote the cumulative distribution function (CDF) and the 
density of the standard normal distribution respectively. Let $ = 1 — <£•. Let P n denote the ra-fold 
product measure of P. We say P is absolutely continuous with respect to Q, denoted by P <C Q, 
if P{A) = for any measurable set A such that Q{A) = 0. We say P is singular with respect to Q, 
denoted by P _L Q, if there exists a measurable A such that P(A) = 1 and Q(A) = 0. We denote 
a n = o(b n ) if limsup^^ = 0, a n = uj(b n ) if b n = o(a n ), a n = 0(b n ) if limsup^^ jf^j < oo 
and a n = fi(6 n ) ^ ^ n = 0{a n ). These asymptotic notations extend naturally to probabilistic setups, 
denoted by op,wp, etc., where limits are in the sense of convergence in probability. 

2 Fundamental limits and characterization 

In this section we define the fundamental limits for testing the hypotheses (10) in terms of 
the sparsity parameter /3. An equivalent characterization in terms of the Hellinger distance is also 
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given. 



2.1 Fundamental limits of detection 

It is easy to see that as the non-null proportion e n decreases, the signal is more sparse and 
the testing problem in (10) becomes more difficult. Recall that e n is given by (3) where f3 > 
parametrizes the sparsity level. Thus, the question of detectability boils down to characterizing the 
smallest (resp. largest) f3 such that the hypotheses in (10) can be distinguished with probability 
tending to one (resp. zero), when the sample size n is large. 

For testing between two probability measures P and Q, denote the optimal sum of Type-I and 
Type-II error probabilities by 

£(P,Q)±M{P(A) + Q(A C )}, (13) 

A 

where the infimum is over all measurable sets A. By the Neyman- Pearson Lemma [27], £(P,Q) is 
achieved by the likelihood ratio test: declare P if and only if g£ > 1. Moreover, £(P,Q) can be 
expressed in terms of the total variation distance 

T\/(P,Q)±sup\P(A)-Q(A)\ = ^ j \dP-dQ\ (14) 

as 

£(P,Q) = 1-jy (P,Q). (15) 
For a fixed sequence {(Q n > G n )}, denote the total variation between the null and alternative by 

V n (p) ± TV(Q£, ((1 - n^)Q n + n^G n ) n ), (16) 

which takes values in the unit interval. In view of (15), the fundamental limits of testing the 
hypothesis (10) are defined as follows. 

Definition 1. 

P* = sup{/3>0:F n (/?)^l}, (17) 
P*= m£{0>O:V n (J3)->O}. (18) 

If = (3*, the common value is denoted by f3* . 

As illustrated by Fig. 1, the operational meaning of (3* and /3 are as follows: for any /?>/?*, 
all sequences of tests have vanishing probability of success; for any /3 < /3*, there exists a sequence 
of tests with vanishing probability of error. In information-theoretic parlance, if j3 = (3* = (3* , we 
say strong converse holds, in the sense that if j3 > /3*, all tests fail with probability tending to one; 
if f3 < (3*, there exists a sequence of tests with vanishing error probability. 

Clearly, f3 and (3* only depend on the sequence {(Q n ,G n )}. The following lemma, proved in 
Appendix A, shows that it is always sufficient to restrict the range of {3 to the unit interval. 
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V n ->1 V n ^0 
t* t ^=o(I) 1 ^ 

Figure 1: Critical values of (3 and regimes of (in)distinguishability of the hypotheses (11) in the 
large- ra limit. 

Lemma 1. 

< p < T < 1. (19) 



In the Gaussian mixture model with Q n = J\f(0, 1), if the sequence {G n } is parametrized by 
some parameter r, the fundamental limit /3* in Definition 1 is a function of r, denoted by (3*(r). 
For example, in the Ingster-Donoho-Jin setup (2) where G n = A/"(/%, 1), /?*, denoted by /3f D j, can 
be obtained by inverting (6): 



Pmzir) 



\+r 0<r<\ 

2 _ 4 . (20) 

l-(l-v^)+ r>\ 



In terms of (20), the detectable region (7) is given by the strict hypograph {(r, 0) : (3 < f3*(r)}. 
The function /3j DJ , plotted in Fig. 2, plays an important role in our later derivations. Similarly, for 
the heteroscedastic mixture (8), inverting (9) gives 




2^ + ct 2 > 2 

As shown in Section 5, all the above results can be obtained in a unified manner as a consequence 
of the general results in Section 3. 

2.2 Equivalent characterization via the Hellinger distance 

Closely related to the total variation distance is the Hellinger distance [26, Chapter 2] 

H 2 (P,Q) = J(VdP-VdQ) 2 , 
which takes values in the interval [0, 2] and satisfies the following relationship: 



l -H 2 {P,Q) < TV(P,Q) < H(P,Q)yJl- H ^ Q ^ < i. (22) 

Therefore, the total variation distance converges to zero (resp. one) is equivalent to the squared 
Hellinger distance converges to zero (resp. two). We will be focusing on the Hellinger distance 
partly due to the fact that it tensorizes nicely under the product measures: 

H 2 (P n , Q n ) = 2 - 2 ( 1 - g2(P ' Q) \ . (23) 







/ 13* (r) 


> 



2 n 1 1 

Oil 1 



4 2 

Figure 2: Ingster-Donoho-Jin detection boundary (20) and the detectable region (below the curve). 



Denote the Hellinger distance between the null and the alternative by 

Hl(P) 4 H 2 (Q n , (1 - n~P)Q n + n-^Gn). (24) 

In view of (17) - (18) and (23), the fundamental limits /3 and /3* can be equivalently defined as 
follows in terms of the asymptotic squared Hellinger distance: 

§* = swp{p>0:Hl(P) = Gj(n- 1 )}, (25) 
f = inf {/3 > : H 2 n (P) = o^ 1 )} . (26) 



3 Main results 

In this section we characterize the detectable region explicitly by analyzing the exact asymp- 
totics of the Hellinger distance induced by the sequence of distributions {{Qn, G n )}. 

3.1 Characterization of (3* for Gaussian mixtures 

This subsection we focus on the case of sparse normal mixture with Q n = JV(0, 1) and G n 
absolutely continuous. We will argue in Section 3.3 that by performing the Lebesgue decomposition 
on G n if necessary, we can reduce the general problem to the absolutely continuous case. 

We first note that the essential supremum of a measurable function / with respect to a measure 
fi is defined as 

ess sup /(x) = inf{a G E : /j({x : f(x) > a}) = 0}. 

X 

We omit mentioning \i if (i is the Lebesgue measure. Now we are ready to state the main result of 
this section. 
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Theorem 1. Let Q n = 7V(0, 1). Assume that G n has a density g n with respect to the Lebesgue 
measure. Denote the log-likelihood ratio by 

4 = log^. (27) 

Let a : M — > R be a measurable function and define 

pi = 1 +0 Vesssup (a(u) - u 2 + ^^Alj . (28) 
2 ueM I 2 J 



liminf^V2J^)> a(u) (29) 

n— >oo log n 



uniformly m u G M, where a > on a sei of positive Lebesgue measure, then [3* > f3K 
2. If 

£ n (u^2 logn) 

limsup < a(u) (30) 

n->oo fog n 

uniformly m u E M, i/ien /3* < f$. 

Consequently, if the limits in (29) and (30) agree and a > on a set of positive measure, then 
f3* = 

Proof. Section 7.2. □ 

Assuming the setup of Theorem 1, we ask the following question in the reverse direction: What 
kind of function a can arise in equations (29) and (30)? The following lemma (proved in Section 7.2) 
gives a necessary and sufficient condition for a. However, in the special case of convolutional models, 
the function a needs to satisfy more stringent conditions, which we also discuss below. 



Lemma 2. Suppose 



£ n (u^2\ogn) 

hm = a(u), (31) 

n->oo log n 



holds uniformly inu EM. for some measurable function a : R — > K. Then 

lim - log / exp(t(a(u) - u 2 ))du = 0. (32) 

t^oo t J R 

In particular, a(u) < u 2 Lebesgue-a.e. Conversely, for all measurable a that satisfies (32), there 
exists a sequence of {G n }, such that (31) holds. 

Additionally, if the model is convolutional, i.e., G n = P n *M(0, 1), then a is convex. 

In many applications, we want to know how fast the optimal error probability decays if /3 lies in 
the detectable region. The following result gives the precise asymptotics for the Hellinger distance, 
which also gives upper bounds on the total variation, in view of (22). 
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Theorem 2. Assume that (31) holds. For any (3 > ^, the exponent of the Hellinger distance (24) 
is given by 

lim l ° gH ^ = E03), (33) 

n->oo log n 

where 

E(/3) = esssup{(2(a(n) - 0)) A (a(u) - 0) - u 2 } (34) 



= ess sup {2a(u) — 2/3 — u 2 } V ess sup {a(u) — f3 — u 2 } (35) 

u:a(u)</3 u:a(u)>P 

which satisfies E(/3) > — 1 (resp. E(/3) < —I) if and only if f3 < 0* (resp. f3 > 0*). 

As an application of Theorem 1, the following result relates the fundamental limit j3* of the 
convolutional models to the classical Ingster-Donoho-Jin detection boundary: 

Corollary 1. Let G n = P n *M(0, 1). Assume that P n has a density p n which satisfies that 

lim ^PnjtV^n) = 

n-^co log n 

uniformly in t G M for some measurable f : M — )■ M. Then 

/3* = esssup{/3 I * DJ (t 2 )-/(t)} (37) 

where /3jrjj is the Ingster-Donoho-Jin detection boundary defined in (20). 

It should be noted that the convolutional case of the normal mixture detection problem is briefly 
discussed in [6, Section 6.1], where inner and outer bounds on the detection boundary are given but 
do not meet. Here Corollary 1 completely settles this question. See Section 5 for more examples. 

We conclude this subsection with a few remarks on Theorem 1. 

Remark 1 (Extremal cases). Under the assumption that the function a > on a set of positive 
Lebesgue measure, the formula (28) shows that the fundamental limit (3* lies in the very sparse 
regime < 0* < 1). We discuss the two extremal cases as follows: 

1. Weak signal: Note that @* = | if and only if a(u) < u 2 — almost everywhere. In this case 
the non-null effect is too weak to be detected for any f3 > ^. One example is the zero- mean 
heteroscedastic case G n = M(0, a 2 ) with a 2 < 2. Then we have a(u) < \. 

2. Strong signal: Note that /3* = 1 if and only if there exists n, such that \u\ > 1 and 

a(u) = u 2 . (38) 

At this particular u, the density of the signal satisfies g n {u\j2 log n) = n~°( l \ which implies 
that there exists significant mass beyond y / 21ogn, the extremal value under the null hypoth- 
esis [10]. This suggests the possibility of constructing test procedures based on the sample 
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maximum. Indeed, to understand the implication of (38) more quantitatively, let us look at 
an even weaker condition: there exists u such that \u\ > 1 and 

lim sup log — — ; = 0, (39) 

n^oo logn 6 P {u-!y n > y/^bg^} V ; 

which, as shown in Appendix B, implies that /3* = 1. 

Remark 2. In general (3* need not exist. Based on Theorem 1, it is easy to construct a Gaussian 
mixture where (3* and (3* do not coincide. For example, let «o and a\ be two measurable functions 
which satisfy Lemma 2 and give rise to different values of /3" in (28), which we denote by /3q < f3\. 
Then there exist sequences of distributions {Gn } and {G n } which satisfy (31) for ao and a\ 
respectively. Now define {G n } by G2k = G^, and G2k+i = G^ . Then by Theorem 1, we have 

3.2 Non-Gaussian mixtures 

The detection boundary in [20, 12] is obtained by deriving the limiting distribution of the log- 
likelihood ratio which relies on the normality of the null hypothesis. In contrast, our approach is 
based on analyzing the sharp asymptotics of the Hellinger distance. This method enables us to 
generalize the result of Theorem 1 to sparse non-Gaussian mixtures (10), where we even allow the 
null distribution Q n to vary with the sample size n. 

Theorem 3. Consider the hypothesis testing problem (10). Let G n <C Q n - Denote by F n and z n 
the CDF and the quantile function of G n , respectively, i.e., 

z n (p) = inf{y €R:F n (y)>p}. (40) 



If the log-likelihood ratio 
satisfies 



lim sup 

n ^°° *>(log 2 n)-i 



in(Zn(n- S )) V l n (z n (l - n~ s )) 

7(s) 



logn 



(42) 



as n —> oo uniformly in s G M + for some measurable function 7 : M+ — > M. If 7 > on a set of 
positive Lebesgue measure, then 

(3* = 1 +0 Vesssup{7(s) -s+ 1 . (43) 
2 s >o I 2 J 



The function 7 appearing in Theorem 3 satisfies the same condition as in Lemma 2. Comparing 
Theorem 3 with Theorem 1, we see that the uniform convergence condition (31) is naturally replaced 
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by the uniform convergence of the log-likelihood ratio evaluated at the null quantile. Using the fact 

z < Hz) 
T+z 7 - <p(z) 



that -7—7-7 < < I for all z > [1, 7.1.13], which implies that 



§(*) = ^(1 + 0(1)) (44) 
uniformly asz-> 00, we can recover Theorem 1 from Theorem 3 by setting 7(5) = a(— y/s) Va(y / s). 

3.3 Decomposition of the alternative 

The results in Theorem 1 and Theorem 3 are obtained under the assumption that the non-null 
effect G n is absolutely continuous with respect to the null distribution Q n . Next we show that it 
does not lose generality to focus our attention on this case. Using the Hahn-Lebesgue decomposition 
[15, Theorem 1.6.3], we can write 

G n = (1 - K n )G' n + K n v n (45) 
for some K n 6 [0, 1], where G' n <C Q n and u n _L Q n . Put 

<4 = %3 — ~~ and Q' n = (1 - e^)Q„ + e' n G' n , (46) 
which satisfies Q n <C Q„. Then (1 - e n )Q n + e n G n = (1 - e n K n )Q' n + 

fn^n^n- By Lemma 7, 

i/ 2 (Q n , (1 - e n )Q n + e n G n ) = ®{e n K n V F 2 ((l - e')Q n + e' n G' n )) (47) 

Therefore the asymptotic Hellinger distance of the original problem is completely determined by 
e n K n and the square-Hellinger distance H 2 ((l — e')Q n + e' n G' n ), which is also of a sparse mixture 
form, with (e n ,G n ) replaced by (e' n ,G' n ) given in (46). In particular, we note the following special 
cases: 

1. If e n K n = 0(n _1 ), then H 2 (Q n , (1 - e n )Q n + e n G n ) = o(n _1 ) (resp. ^(n^ 1 )) if and only if 
H 2 (Q n , (1 — e' n )Qn + t' n G' n ) = o(n _1 ) (resp. w(n -1 )), which means that detectability of the 
original sparse mixture coincide with the new mixture. 

2. If e n K n = w(n _1 ), then H 2 (Q n , (1 — e n )Q n + e n G n ) = w(ra -1 ), which means that the original 
sparse mixture can be detected reliably. In fact, a trivial optimal test is to reject the null 
hypothesis if there exists one sample lying in the support of the singular component v n . 



4 Adaptive optimality of Higher Criticism tests 

As discussed in Section 2.1, the fundamental limit f3* of testing sparse normal mixtures (11) 
can be achieved by the likelihood ratio test. However, in general the likelihood ratio test requires 
the knowledge of the alternative distribution, which is typically not accessible in practice. To 
overcome this limitation, it is desirable to construct adaptive testing procedures to achieve the 
optimal performance simultaneously for a collection of alternatives. This problem is also known 
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as universal hypothesis testing. See, e.g., [19, 33, 32] and the references therein, for results on 
discrete alphabets. The basic idea of adaptive procedures usually involves comparing the empirical 
distribution of the data to the null distribution, which is assumed to be known. 

For the problem of detecting sparse normal mixtures, it is especially relevant to construct 
adaptive procedures, since in practice the underlying sparsity level and the non-zero priors are 
usually unknown. Toward this end, Donoho and Jin [12] introduced an adaptive test based on 
Tukey's higher criticism statistic. For the special case of (2), i.e., P n = 5 logn , it is shown that 
the higher criticism test achieves the optimal detection boundary (20) while being adaptive to the 
unknown non-null parameters (/3,r). Following the generalization by Jager and Wellner [22] via 
Renyi divergence, next we explain briefly the gist of the higher criticism test. 

Given the data Yi, . . . , Y n , denote the empirical CDF by 

1 n 

F nW = -E 1 m<*}> 
i=i 

respectively. Similar to the Kolmogorov-Smirnov statistic [30, p. 91] which computes the Loo- 
distance (maximal absolute difference) between the empirical CDF and the null CDF, the higher 
criticism statistic is the maximal pointwise x 2 -divergence between the null and the empirical CDF. 
We first introduce a few auxiliary notations. Recall that the ^-divergence between two probability 
measures is defined as 

dP x 1 



In particular, the binary x 2 -divergence function (i.e., the x 2 -divergence between Bernoulli distri- 
butions) is given by 

X 2 (Bern(p) ||Bern(g)) = — r, 

9(1 - Q) 

where Bern(p) denotes the Bernoulli distribution with bias p. The higher criticism statistic is 
defined by 



HC n = sup v / nx 2 (Bern(F n (t)) || Bern($(t))) (48) 
n sup 1 ) ' _ y J 1 (49) 



U LI 2 J , 

tm y/$(t)$(t) 

Based on the statistics (48), the higher criticism test declares Hi if and only if 

HC n > V2(l + 5)loglogn (50) 

where 5 > is an arbitrary fixed constant. 

The next result shows that the higher criticism test achieves the fundamental limit f3* charac- 
terized by Theorem 1 while being adaptive to all sequences of distributions {G n } which satisfy the 
regularity condition (31). This result generalizes the adaptivity of the higher criticism procedure 
far beyond the original equal-signal-strength setup in [12] and the heteroscedastic extension in [6]. 

Theorem 4. Under the same assumption of Theorem 1, for any f3 > f3* , the sum of Type-I and 
Type-II error of the higher criticism test (50) vanishes as n — » oo. 
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5 Examples 



In this section we particularize the general result in Theorem 1 to several interesting special 
cases to obtain explicit detection boundaries. 

5.1 Ingster-Donoho-Jin detection boundary 

We derive the classical detection boundary (20) from Theorem 1 for the equal-signal-strength 
setup (2), which is a convolutional model with signal distribution 

Pn = <W (51) 

and [i n in (4). The log- likelihood ratio is given by 

£ n (y) = log - — r n = — ^ + fi n y = -rlogn+ y / 2r\ogny. 

<p{y) 2 

Plugging in y = u\ / 2logn, we have £ n (u\/2 logra) = — rlogn + 2u^Jr\ogn. Consequently, the 
condition (31) is fulfilled uniformly in u G M with 

a(u) = 2uy/r — r. (52) 

Straightforward calculation yields that 

f /- 2 u 2 All |r 0<r<i 
ess sup < 2u\/r — r — u H — > = < _ (53) 



«a i 2 J |J-(l-^)+ r>|. 

Applying Theorem 1, we obtain the desired expression (20) for /3*(r). 
As a variation of (51), the symmetrized version of (51) 

Pn = \{5^ + 5-» n ) (54) 

was considered in [21, Section 8.1.6], whose detection boundary is shown to be identical to (20). 
Indeed, for binary- valued signal distributed according to (54), we have 

i n (u\/2\ogn) = ^ + log cosh(/x n n v 7 2 log n) 

= - r log n + log(n 2 "^ + rT 2 ^') - log 2 

which gives rise to 

a(u) = 2\u\ \fr — r (55) 
Comparing (55) with (52) and (53), we conclude that the detection boundary (20) still applies. 
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5.2 Dilated signal distributions 

Generalizing both the unary and binary signal distributions in Section 5.1, we consider P n that 
is the distribution of the random variable 

X n = fi n X (56) 

where \i n > is a sequence of positive numbers and X is distributed according to a fixed distribution 
P, parameterizing the shape of the signal. In other words, P n is the dilation of P by \x n . We ask the 
following question: By choosing the sequence jx n and the random variable X, is it possible to have 
detection boundaries which are shaped differently than the classical Ingster-Donoho-Jin detection 
boundary? 

It turns out that for \i n = \/2 log n, the answer to the above question is negative. As the 
next theorem shows, the detection boundary is given by that of the classical setup rescaled by 
the Loo-norm of X. Note that (51) and (54) corresponds to P = 5^. and P = \{8 ^ + S_^p), 
respectively. 

Corollary 2. Consider the convolutional model G n = P n *M(0, 1), where P n is the distribution of 
v^log nX. Then 



/3*=Adj(II* 



|2 N 

loo; 



\x\\l + h o<\\x\\ OQ <l 



i-(i-WU+ M»>5- 



Proof. Recall that /?i* D j( - ) denotes the Ingster-Donoho-Jin detection boundary defined in (20). Since 



the log- likelihood ratio is given by £ n (y) = E exp( — ^ + X n y) 



we have 



e n (uy/2\ogn) = logE 



n 



-X 2 +2uX 



ess sup {-X 2 + 2uX\ logn(l + o(l)), 

x 



(58) 



where we have applied Lemma 3 and the essential supremum in (58) is with respect to P, the 
distribution of X. Therefore a{u) = esssup^ {— X 2 + 2uX}. Applying Theorem 1 yields the 
existence of (3* , given by 



- + ess sup | ess sup {—X 2 + 2uX\ — u 2 + 
2 ueR [ x 



it A 1 



2 
1 

- + ess sup ess sup 
2 x 



-X 2 + 2uX - u 2 + 



u 2 A 1 



esssup^* DJ (X ) 
x 



ID.] ^ \\X II oo)' 



where (59) follows from the facts that /3j DJ (-) is increasing and that \\X\ 



(59) 

ess sup | X\. □ 
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Remark 3. Corollary 2 tightens the bounds given at the end of [6, Section 6.1] based on the 
interval containing the signal support. From (57) we see that the detection boundary coincides 
with the classical case with y/r replaced by Loo-norm of X. Therefore, as far as the detection 
boundary is concerned, only the support of X matters and the detection problem is driven by the 
maximal signal strength. In particular, for ||-X"|| > 1 or non-compactly supported X, we obtain 
the degenerate case /3* = 1 (see also Remark 1 about the strong-signal regime). However, it is 
possible that the density of X plays a role in finer asymptotics of the testing problem, e.g., the 
convergence rate of the error probability and the limiting distribution of the log-likelihood ratio at 
the detection boundary. 



One of the consequences of Corollary 2 is the following: as long as fi n = y / 2~Togn, non-compactly 
supported X results in the degenerate case of f3* = 1, since the signal is too strong to go undetected. 
However, this conclusion need not be true if fj, n behaves differently. We conclude this subsection by 
constructing a family of distributions of X with unbounded support and an appropriately chosen 
sequence {fi n }, such that the detection boundary is non-degenerate: Let X be distributed according 
to the following generalized Gaussian (Subbotin) distribution P T [31] with shape parameter r > 0, 
whose density is 

Pr(x) = exp(-|x| T ). (60) 

Put [x n = \/2r(logn) 2~r . Then the density of X n is given by v n (x) = t^p(t^)- Hence 

Vn (t^2logn) = — ^— n-W", 

21 [T)H n 

which satisfies the condition (36) with f(t) = \t\ T r~ 2 . Applying Corollary 1, we obtain the detection 
boundary f3* (a two-dimensional surface parametrized by (r, r) shown in Fig. 3) as follows 

13* = sup{/3 I * DJ (t 2 ) - |t|V5} = sup{fi D3 (rz 2 ) - z T } (61) 

teM. z>o 

where (20) is the Ingster-Donoho-Jin detection boundary. 

Equation (61) can be further simplified for the following special cases. 

• t = 1 (Laplace): Plugging (20) into (61), straightforward computation yields 

1 A 1 \ 2 (fl-jo^ 2 r>l + V2 



P* = - V 1 



2 V 2^7 + |i r<l + V2' 

t = 2 (Gaussian): In this case we have X ~ J\f(0, |) and X n ~ A/"(0, r). This is a special case 
of the heteroscedastic case in [6], which will be discussed in detail in Section 5.3. Simplifying 
(61) we obtain 

(3* = - V 



2 1 + r' 
which coincides with (67). 
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2 4 6 8 10 

Figure 3: Detection boundary (3* given by (61) as a function of r for various values of r. 



5.3 Heteroscedastic normal mixture 

The heteroscedastic normal mixtures considered in (8) corresponds to 

G n = N{ii n ,(T 2 ) 

with \i n given in (4) and a 2 > 0. In particular, if a 2 > 1, G n is given by the convolution G n = $*P n , 
where the Gaussian component P n = M{^i n , a 2 — 1) models the variation in the signal amplitude. 
For any m£K, 



1 («V / 21ogn) = log 



uV21ogn-/j n 



<p(u^2 log n) 



a(u) log n, 



(62) 



where 



a(u) = u 



2 (« - V^) 2 



Similar to the calculation in Section 5.1, we have 1 

sup \a(s) --} 

0<s<l 



2-ct 2 



1 (l~v^) 2 + 



+ o- 2 < 2 
2 v / f + cr 2 > 2 



and 



Note that 



sup {a(s) — s} 

8>1 



(1 - ^) 2 



(7- 



(63) 



(64) 



2-g 3 " ^2 o* 

applying Theorem 1, we have 



l _ (l >/r)U > ^ +2 ^ 2 - 2 ) 2 > if 2^r + cj 2 < 2. Assembling (63) - (64) and 



(3*(r,a 2 ) 



2a 2 (2-a 2 ) 



+ 



V 



1 (1 ~ Vr) 2 + 
2 



(7- 



2 + 2-tr 2 
-i (1-^)1 



2^ + ct 2 < 2 
27r : + (T 2 > 2. 



(65) 
(66) 



1 In the first case of (63) it is understood that § = 0. 
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Solving the equation /3*(r, a 2 ) = (3 in r yields the equivalent detection boundary (9) in terms of r. 
In the special case of r = 0, where the signal is distributed according to P n = J\f(0, r 2 ), we have 

^ (0 ' 1 + t2) = TT^VT (67) 

Therefore, as long as the signal variance exceeds that of the noise, reliable detection is possible in 
the very sparse regime fi > \ , even if the average signal strength does not tend to infinity. 

5.4 Non-Gaussian mixtures 

We consider the detection boundary of the following generalized Gaussian location mixture 
which was studied in [12, Section 5.2]: 

H (n) . ^ i.Ld. p ^ vergus R (n) . ^ i.Ld. (1 _ ^ _ + ^ _ ^ ( g g) 



where P T is defined in (60), and \x n = (r log n) t . Since z(l — n s ) = z(n s ) = (slogn)r (1 + o(l)) 
uniformly in s, (42) is fulfilled with 7(5) = s — \st — rr\. Applying Theorem 3, we have 



0*1 \ 1 « Z' 1 i 1 . s A 1\ 1 / . 1. n r Al 

/3 (r) = - + V sup \-\sr - rr \ + — — = - + V sup l-\u - rr \ + — — 
2 s >0 V 2 J 2 u >o \ 2 



r > 1 

2 " 



N '' r<l,r<l, 



1 



\ + 5 V r r > l,r < (1 - 2^) 7 



(69) 



(1-21-1- y 

l-(l-r^) T r>l,r>(l-2^ 



It is easy to verify that (69) agrees with the results in [12, Theorem 5.1]. Similarly, the detection 
boundary for exponential-^ 2 mixture in [12, Theorem 1.7] can also be derived from Theorem 3. 

6 Discussions 

We conclude the paper with a few discussions and open problems. 

6.1 Moderately sparse regime < (3 < | 

Our main results in Section 3 only concern the very sparse regime \ < fi < 1. This is because 
under the assumption in Theorem 1 that a > on a set of positive Lebesgue measure, we always 
have (3* > \. One of the major distinctions between the very sparse and moderately sparse 
regimes is the effect of symmetrization. To illustrate this point, consider the sparse normal mixture 
model (11). Given any G n , replacing it by its symmetrized version G n (dx) = Gn ( dx ) + G n ( dx) 
always increases the difficulty of testing. This follows from the inequality H 2 (G n , 3>) < H 2 (G n , $), 
a consequence of the convexity of the squared Hellinger distance and the symmetry of <£. A 
natural question is: Does symmetrization always have an impact on the detection boundary? In 
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the very sparse regime, it turns out that under the regularity conditions imposed in Theorem 1, 
symmetrization does not affect the fundamental limit /3*, because both G n and G n give rise to 
the same function a. It is unclear whether (3* and j3* remain unchanged if an arbitrary sequence 
{G n } is symmetrized. However, in the moderately sparse regime, an asymmetric non-null effect 
can be much more detectable than its symmetrized version. For instance, direct calculation (see 
for example [6, Section 2.2]) shows that f3*(r) = \ — r for G n = S n - r , but /3*(r) = ^ — 2r for 

G n = \{&n- r + fi—n- T )- 

Moreover, unlike in the very sparse regime, moment-based tests can be powerful in the moder- 
ately sparse regime, which guarantee that f3* > ^. For instance, in the above examples G n = 5 n -r 
or G n = \{o~ n ~ r +^-n- r )i the detection boundary can be obtained by thresholding the sample mean 
or sample variance respectively. More sophisticated moment-based tests such as the excess kurtosis 
tests have been studied in the context of sparse mixtures [24] . It is unclear whether they are always 
optimal when (3 < ^. 

6.2 Adaptive optimality of higher criticism tests 

While Theorem 4 establishes the adaptive optimality of the higher criticism test in the very 
sparse regime j3 > ^, the optimality of the higher criticism test in the moderately sparse case 
(3 < \ remains an open question. Note that in the classical setup (2), it has been shown [6] that 
the higher criticism test achieves adaptive optimality for f3 6 [0, ^] and [i n = n~ r . In this case 
since [i n = o(l), we have a = and Theorem 1 thus does not apply. It is possible to obtain a 
counterpart of Theorem 1 and an analogous expression for f3* for the moderately sparse regime 
if one assumes a similar uniform approximation property of the log-likelihood ratio, for example, 
£ n (u\/log n) = n~ a ^ + °^ for some function a. Another interesting problem is to investigate the 
optimality of procedures introduced in [22] based on Renyi divergence under the same setup of 
Theorem 4. 



7 Proofs 

7.1 Auxiliary results 

Laplace's method (see, e.g., [13, Section 2.4]) is a technique for analyzing the asymptotics of 
integrals of the form J exp(M/)di^ when M is large. The proof of Theorem 1 uses the following 
first-order version of the Laplace's method. Since we are only interested in the exponent (i.e., the 
leading term), we do not use saddle-point approximation in the usual Laplace's method and impose 
no regularity conditions on the function / except for the finiteness of the integral. Moreover, the 
exponent only depends on the essential supremum of / with respect to v, which is invariant if / is 
modified on a i^-negligible set. 

Lemma 3. Let (X, J 7 , v) be a measure space. Let F : X x M + — > M + be measurable. Assume that 

J™ M = ^ X > ( 70 ) 

M— >oo 1V1 
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holds uniformly in x £ X for some measurable f : X — > K. // J x exp(Mo/)dz^ < co for some 
M > 0, then 

lim —log / M)dv = ess sup /(x). (71) 

M->oo M J x x( z X 

Proof. First we deal with the case of ess sup / = cxd, which implies that v{{f > a}) > for all 
a > 0. Moreover, by Chernoff bound, v{{f > a}) < exp(-Moa) J exp(Mo/)d^ < oo. By (70), for 
any e > 0, there exists K > Mq such that 

exp(M(/(x) - e)) < F(x, M) < exp(M(/(x) + e)) (72) 

for all x £ X and M > K. Therefore, / F(x, M)dv > exp(-Me) / exp(Mf)du > exp(M(a - 
e))v({f > a}) for any M > and a > 0. Then lim inf m-*oo jj log / exp(M/)d^ > a — e. By the 
arbitrariness of a and e, we have Mthm^oo jj log / exp(M/)d^ = oo. 

Next we assume that ess sup/ < cxd. By replacing / with / — ess sup/, we can assume that 
ess sup / = without loss of any generality. Then / < v-a.e. Hence, by (72), 

J F(x,M)dv< J exp(M(/ + e))d^ < exp(Me) J exp(M /)d V < CXD 

holds for all M > K. By the arbitrariness of e, we have 

limsup — log / exp(M/)dz^ < 0. 
M-too M J 

For the lower bound, note that, by the definition of ess sup / = 0, v{{f > —5}) > for all 5 > 0. 
Therefore, by (72), we have 

J F(x, M)dv > exp(-Me) J exp(M/)di^ > exp(-M(5 + e))u({f > -5}) 

for any M > and 5 > 0. First sending M — > cxd then 5 \. and e J. 0, we have 

liminf — log / exp(M/)dz/ > 0, 

completing the proof of (71). □ 

The following lemma is useful for analyzing the asymptotics of Hellinger distance: 

Lemma 4. 1. For any b > 0, the function s i— > + b(s — 1) — l) 2 is strictly convex on M + 
and strictly decreasing and increasing on [0,1] and [1,cxd), respectively. 

2. For any t>0, 

(V2 - l) 2 t A t 2 < (Vl + t - l) 2 < t A t 2 . (73) 



Proof. 1. Since i i— )• \/l + i is strictly concave, s i-)- (a/1 + 6(s — 1) — l) 2 = 2 + b(s — 1) — 
2y^b(s — 1) is strictly convex. Solving for the stationary point yields the minimum at s = 1. 
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2. First we consider t > 1. Since t i— > + i — l) 2 = t — 2\/l + t is convex, i h-> — — — is 
increasing. Consequently, we have (\/2 — l) 2 < < i for all i G [l,oo). 

Next we consider < t < 1. By the concavity of i i — > \/l + 1, i i— > v/1H f f ~ 1 is decreasing. 
Hence \pl — 1 < v/1H t *~ 1 < | for all i € [0, 1]. Assembling the above two cases yields (73). □ 

The following lemmas are useful in proving Theorem 4: 

Lemma 5. Let f : R — > K 6e measurable and /j, be any measure on M. T/ie function g defined by 

g(s) = ess sup /(g) 

q>s 

is decreasing and lower- semicontinuous, where the essential supremum is with respect to /i. 

Proof. The monotonicity is obvious. We only prove lower-semicontinuity, which, in particular, also 
implies right-continuity. Let s n — > s. By definition of the essential supremum, for any 5, we have 
ji{q > s : f(q) > g(s) — 6} > 0. By the dominated convergence theorem, fi{q > s n : f{q) > g(s) — 
5} — > n{q > s : f(q) > g(s) — 5}. Hence there exists N such that fi{q > s n : f(q) > g(s) — 6} > 
for all n > N, which implies that g{s n ) > g(s) — 5 for all n > N. By the arbitrariness of 5, we have 
liminfj^oo g(s n ) > g(s), completing the proof of the lower semi-continuity. □ 

Lemma 6. Under the conditions of Theorem 1, for any u > 0, 

log((l - F n (u^2hj^)) A F n (-My / 2fog^)) / \ A S ( \ 2\ 

am : = v(u) = esssup{a((/J — q ). 

n->oo log n q>u 



Proof. First assume that u > 0. Then 



1 - F n (u^2 log n) = f exp(£ n (y))(f)(y)dy = f exp(e n (q v / 2logn))n q2 dq 

J y>Uy/2 log n ' J q>u 

v(u)+o(l) 



n 



where the last equality follows from Lemma 3. The proof for u < is completely analogous. □ 
7.2 Proofs in Section 3 

Proof of Theorem 1. Let W ~ Af(0, 1). Put v n = (1 - n" /3 )AA(0, 1) + n~^G n . Since G n < $ by 
assumption, we also have v n <C J\f(0, 1). Denote the likelihood ratio by L n = = exp(£ n ). Then 

^ = l + n-' 3 (exp(^ n )-l). (74) 

(Direct part) Recall the notation defined in (28), which can be equivalently written as 

~« 1 f , , 2 U 2 A1 

+ ess sup < a + {u) — u + 



2 ugr I 2 
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Assuming (29), we show that j3* > by lower bounding the Hellinger distance. To this end, fix an 
arbitrary 5 > 0. Let f3 = — 25. Denote by A the Lebesgue measure on the real line. By definition 
of the essential supremum, \{u : a+(u) - u 2 + > + 5 - 5} > 0. Since -u 2 + < for all 
and (3 + 5 - \ > -5, we must have \{u : a(u) - u 2 + > (3 + 5 - \ ,a(u) > 0} > 0. Since, by 



u 



assumption, \{u : a(u) > 0} > 0, there exists < e < |, such that 



A < u : a(u) — u 2 + 



By assumption (29), there exists N e E N such that 

4(u\/^ logn) > (a(u) — e) log re 

holds for all u £ R and all n > N e . From (75), we have either 

u 2 „ „ 1 



A < u : ki < 1, am) 



or 



2 >/3 + 5--,a(n) >2e|> >0 



A {u : |u| > 1, a{u) - u 2 > f3 + <5 - 1, a(it) > 2e} > 0. 



Next we discuss these two cases separately: 
Case I: Assume (77). Let 

U 



w 



AM 



n 



2 log n 



The square Hellinger distance can be lower bounded as follows: 



H 2 (f3) = H 2 (P,u n ) = J ^ 



dP 



1 dP 



E 



> E 



> E 



1 + n-/ 3 (exp(4(^ v / 21ogn)) - 1) - 1^ 

V 1 + n-e(e W (tn(UV^)) " 1) " l) ' \ ]miMU) -%^frM>*} 
(^Jl + n-P(n a ( u )-* -l)-l) lr 



\U\<l,a(U)-!§->P+5-±,a(U)>2e\ 



(V2-I? 

> ^ J —E 



n («(C0-e-j8)A2(a(£0- 6 -/3) 1 



(>/2-l)VEg?i 



> 



(V2-l) 2 v / Iogli 



(75) 

(76) 

(77) 
(78) 

(79) 



(80) 



(81) 
(82) 



n 



{|C/|<l,a(t/)-^>/3+<5-|,a((7)>2e} 

{|«|<l,a(«)-^>/3+5-|,a(«)>2 e } dn ^ 

(84) 



U 1 

A <j |u| < 1, a{u) - — > (3 + 5 - -,a(u) > 2e } n~ 1+ 5 



where 

• (80): By (74). 
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(81): By Lemma 4.1 and (76). 



• (82): Without loss of generality, we can assume that rf > 2. Then applying the lower bound 
in Lemma 4.2 yields the desired inequality. 

• (83): We used the density of U defined in (79). 

• (84): Given that \u\ < 1 and a(u) — \ > (3 + 5 — |, we have both a(u) — e — (3 — u 2 > 
-^f- + 5 - e > -1 + | and 2a(u) - 2e - 2/3 - u 2 > -1 + 25 - 2e > -1 + 5. 

Case II: Now we assume (78). Following analogous steps as in the previous case, we have 



(V2-1) 2 V^ 



> 



4\Ar 
4^ 



(a{u)-e-f3)/\2(a{u)-e-P)-u 2 -. j„ 
n 1 {|«|>l,a(«)-u 2 >/3+5-l,a(«)>2e}°- u 



A {\u\ > 1, a(u) -u 2 > 13 + 5-1, a{u) > 2e] rT 1 



(85) 



where (85) is due to the following: Since \u\ > 1 and a(u) — u 2 > (3 + 6—1, we have both 
a(u) - e- /3-u 2 >5-e-l>-l + |and 2a(u) - 2e - 2(3 - u 2 > v 2 - 2 + 25 - 2e > -1 + 5. 

Combining (84) and (85) we conclude that H 2 ((3) = w(n _1 ). By the arbitrariness of 5 > and 
the alternative definition of (3* in (25), the proof of (3* > is completed. 
(Converse part) Fix an arbitrary 5 > 0. Let 



(3 = $ + 25. 



(86) 



We upper bound the Hellinger integral as follows: First note that 

2 



H 2 M = E 



l + n-0(L n -l)-l) l {Ln >i } 



+E 



l + n-P(L n -l)-l) l {Ln < 1} 



• (87) 



Applying Lemma 4.1, we have 
E 



l + rH»(L n -l)-l l {Ln < 1} 



< (Vl-n-P - l) 2 < n- 213 = o(n- v ), 



since (3 > /?" > \ by (86). Consequently, the asymptotics of the Hellinger integral H 2 ((3) is 
dominated by the first term in (87), denoted by a n , which we analyze below using the Laplace 
method. 

By (30), there exists Ng G N such that 



£ n (u^2logn) < (a(u) + 5) logn 



(89) 
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holds for all u € M and all n > N$. Then 



a n = E 



= E 

< E 

< E 

< E 



l + n-P(L n -l)-l\ l {Ln > 1} 
(J 1 + n -P(exp(e n (Uy/2 logn))"^l) - l {in > 1} 



n 



l+n-f>(n<*M+*-l)-l) l { a(u)>-6} 

(2(a(U)+6-/3))A(a(U)+8-j3) 



log n 



7T 



7? 



(2(a(u)+8~P))A(a(u)+S~(3)-u 2 du 



(90) 

(91) 
(92) 



where (90) and (91) are due to (89) and Lemma 4.2, respectively. Next we apply Lemma 3 to 
analyze the exponent of (92). First we verify the integrability condition: 



n 



(2( a (u)+S-p))A(a(u)+S-l3)-u 2 du < ^5-/3 / n <*(u)-u 2 du 



< oo 



in view of (32). Applying (71) to (92), we have 

Q < n csssup uea {(2(a(«)+5-/3))A(a(«)+5-/3)-i 1 2 }+o(l)_ 



(93) 



By (86), a (it) - u 2 + ^ < P - \ - 25 holds a.e. Consequently, a(u) - u 2 < (3 - 1 - 25 holds for 
almost every u G (— oo, —1] U [1, oo) and a(u) — ^<fi—\ — 25 holds for almost every u G [—1,1]. 
These conditions immediately imply that 



(2(a(u) + 5 - (3)) A (a(u) + 5 - /3) - < -1 + 5 



(94) 



holds a.e. Assembling (87) and (93), we conclude that H 2 {(3) = o(n ). By the arbitrariness of 
5 > and the alternative definition of /3* in (26), the proof of j3 < /3" is completed. □ 

Proof of Theorem 2. In view of the proof of Theorem 1, the desired (33) readily follows from 
combining (84), (85), (88) and (93). □ 



Proof of Lemma 2. Put 

c(t) = J exp(t(a(u) — u 2 ))du. 
(Necessity) Since c(i) > 0, it is sufficient to prove 

log c(t) 
hmsup < 0. 

t^oo t 



(95) 



(96) 



Since J g n = 1, we have J g„(uvTogn)du = (logn) 2. By assumption, g n (u\/logn) = n a ^ " 2 +°( 1 ) 
uniformly in u. Then for all 5 > 0, c(logn) = / n Q ( u )- u2 du < ^= < 00 holds for sufficiently 
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large n. In particular, c(logn) < n ^. For general t > 0, let n\ = [exp(i)J, n 2 = [exp(t + 1)] 
and ii = logfii.i = 1,2. Put p = |E^,9 = = p and b = Then § e I ' 1 ]- Holder's 



i i 



inequality yields c(t) = / e a ( a ( u )- u ) e b ( a ( M )-" )du < c(ti)*c(t 2 )« < c(logni)c(logn 2 ) < exp(o(t)), 
which gives the desired (96). It then follows from Lemma 3 that esssup u {a(u) — u 2 } < 0, i.e., 
a(u) < u 2 a.e. 

(Sufficiency) Let a be a measurable function satisfying (32). Let G n be a probability measure with 
the density 

9n(y) = -j, : A exp \ a ( — | J logn - — - 

c(logra)vlogrt [ \v21ogn/ 2 

which is a legitimate density function in view of (95). Then the log-likelihood ratio satisfies 

£ n (uy/logn) = log c(log ^ logn + which fulfills (31) uniformly. 

For convolutional models, the convexity of a is inherited from the geometric properties of 

the log-likelihood ratio in the normal location model: Since y H > log E ^^^^ is convex for any 

random variable X (see, e.g., [18, Property 3] and [14]), we have £ n ((0- — t)u + tv)y/2 log n) < 

(1 — t)£ n (uy/2 logn) + t£ n (vy/2 logn) for any t S [0, 1] and u,v £ R. Dividing both sides by logn 

and sending n — > oo, we have a((l — i)u + tv) < (1 — t)a(u) + ta(v). □ 

Proof of Corollary 1. Since g n = ip * p n , we have 



a/2 log n / </?((u - t)\/ 2 \°g n )Pn(xV 2 logn)dx 



= n°« / n -^-t) 2 -f(t)+o(i) dx 
Jm 

= n -essinf zea {(«-t) 2 +/(t)}+o(l) 

where the last equality follows from Lemma 3. Plugging the above asymptotics into £ n = log 
we see that (31) is fulfilled uniformly in u £ 1 with a(u) = u 2 — essinf zg ig{(u — \frz) 2 + |z| r }. 
Applying Theorem 1, we obtain 

p = - + ess sup ess sup < — (u — t) —j{t)-\ — 



2 ueM ieM l * 

1 f f . .n M 2 A 1 

= — h ess sup < — + ess sup < — (u — t) H 

2 ten I ugr I 2 

= su P {^ DJ (i 2 ) -/(*)} 

where the last step follows from the (53). □ 

Proof of Theorem 3. Let W n ~ Q n . Put z/ n = (1 -n~ /3 )$ + n^ /3 G n . Since G n <C Q n by assumption, 
we also have v n -C -P. Denote the likelihood ratio (Radon-Nikodym derivative) by L n = = 
exp(£ n ). Then 

^ = l + n-^exp(£ n )-l). (97) 
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Instead of introducing the random variable U in (79) for the Gaussian case, we apply the 
quantile transformation to generate the distribution of W n : Let U be uniformly distributed on the 
unit interval. Then S = log y which is exponentially distributed. Putting S n = ^J-^, we have 



Set r n (s) 



W n ®z n (U) = z n (n- s -) { ^z n (1 - n" 5 ") 
Zn(n~ s ) and t n (s) = l n o z n (l — n~ s ), which satisfy 



sup \r n (s) — ao(s)logn\ < 5logn 

s>log„ 2 

sup \t n (s) — ai(s) log n\ < 5 log n 

s>lo R „ 2 



(98) 

(99) 
(100) 



for all sufficiently large n. For the converse proof, we can write the square Hellinger distance as an 
expectation with respect to S n : 



H 2 M = E 



(J 1 + n-P(exp(£ n (z n (U))) - 1) - lj 1 { 



0<C/<i} 



+ E 



1 + n-P(e W (£n(zn(l - CO)) ~ 1) - l) 1 {o<c/<|} 



Analogous to (88), by truncating the log- likelihood ratio at zero, we can show that the Hellinger 
distance is dominated by the following: 



a n = E 



+ E 



E 



1 + n-P(eMUzn(U))) 1 { 
(J 1 + n-/ 3 (exp(4(2 n (l - U))) 1{ 



0<J7<§,r«(S„)>0} 



0<[/<|,t„(S„)>0} 



-{5„>log n 2,r n (5„)>0} 
2 



< E 



l+n-/>(acp(r n (^))-l)-lj 1 

(J 1 + n-^(exp(t n (5„)) - 1) - 1^ l{s„>io gn 2,t„(S„)>0} 
^ 1 + n -P( n ao(S n )+s - 1) - 1^ + (^Jl + n -/3( n "i(Sn)+5 _ i) _ i 



101) 



< 2E 



2(Q VQi(C/)+<5-^)A(a Vai(!7)+5-/3) 



< n 



-1-5 



(102) 

(103) 
(104) 



where (101) follows from (97) - (98), (102) from (99) - (100) and (104) from (92) - (94). The direct 
part of the proof is completely analogous to that of Theorem 1 by lower bounding the integral in 
(101). □ 
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7.3 Proof of Theorem 4 

Proof. Let U{ = <&(Xi), which is uniformly distributed on [0, 1] under the null hypothesis. With a 
change of variable, we have 

HC-W'f-*"" (105) 
= Vn sup — , (106) 

0<u<l \/U(l — Uj 

tic F 

which satisfies that ^ 2 log Fog n ~* ^ P" 604]. Therefore the Type-I error probability of 
the test (50) vanishes for any choice of S > 0. It remains to show that HC n = wp(loglogn) 
under the alternative. To this end, fix < s < 1 and put r njS = &(\/2s log n) and p niS = 
(1 - n-^)$(v/2slogn) + n"^G n (V2s logra). By (105), we have 

trn ^ t/ n A ^F n (V2slogn) - r„ )S 

HC n > V n {s) = yjn =r- L - (107) 

N„(s) — nr„ s 

n ^=, (108) 



v / wv^(I - r„ jS )' 

where N n (s) = ^™ =1 l{x >v^3Togn} ^ s binomially distributed with sample size n and success prob- 
ability p n>s . Therefore 

E [V n {S)\ = yn— = M . (109) 



y/ ?"n,s(l — r n,s) \/ ^n,s(l — r n,s) 

and 



\/arV n (s) = _ — L y. (110) 



By Chebyshev's inequality, 



Vn(s) < W„( S )]1 < 4VarK(S) " Wl-Prs.) 



2 WJ J E[K(s)] 2 n(Pn,s-r n , s ) 2 ' 
By Lemma 6, 

l-G n (y27Toi^)=n^) + °( 1 ), (111) 
where u(s) = esssup 9>s {a((7) — q} > —s. Plugging (111) into (109) and (110) yields 

E[V n (s)} =n ^-^W+°(i) (112) 

and 

v n (s) < h& [v n ( s )]\ < n w->-i-M>)+om + n ^-i-«w+o(i). (n3) 

Suppose that j3 < ^-+v(s). Then E [V^(s)] = o^ydoglogn). Moreover, we have 2/3— s — l—2v(s) < 
and /3 - 1 - v(s) < ^ < since s < 1. Combining (107), (112) and (113), we obtain 



P {HC n > ^(2 + 5) log log n} = 1 - o(l), 
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that is, the Type-II error probability also vanishes. Consequently, a sufficient condition for the 
higher criticism test to succeed is 

P < sup ] ^+v(s) (114) 

0<s<l 2 

1 + S 

= esssup — \-v(s), (115) 

o<s<i 2 

where (115) follows from the following reasoning: By [28, Proposition 3.5], the supremum and the 

essential supremum (with respect to the Lebesgue measure) coincide for all lower semi-continuous 

functions. Indeed, v is lower semi-continuous by Lemma 5, and so is s *— y !±« + v(s). 

It remains to show that the right-hand side of (115) coincides with the expression of (3* in 

Theorem 1. Indeed, we have ( \ 

ess sup {s + 2v(s)} = ess sup < s + 2 ess sup{a(q) — q} > 

0<s<l 0<s<l [ q>s ' J 

= esssup esssup {2a(q) — 2q + s} 

q>0 qAl<s<l 

= ess sup {2a(q) — 2q + q A 1} 
<?>o 

Note that the second equality follows from interchanging the essential supremums: For any bi- 
measurable function (x,y) \-t f(x,y), 

ess sup ess sup f(x, y) = ess sup ess sup/ (x,y) = ess sup f(x, y), 

x y y x x,y 

where the last essential supremum is with respect to the product measure. Thus the proof of the 
theorem is completed. □ 

A Hellinger distances for mixtures 

This appendix collects a few properties of total variation and Hellinger distances for mixture 
distributions. 

Lemma 7. Let < e < 1 and Qi _L P. Then 

H\P, (1 - e)Q + eQ x ) = 2(1 - VT^) + Vl^~eH 2 (P, Q ) (116) 

which satisfies 

1 H 2 (P,(l-e)Q + eQ 1 ) 

4" eVH*(P,Q ) " 4 {LU) 
Proof. Since Qi _L P, there exists a measurable set E such that P(E) = and Qi(E) = 1. Then 

H 2 (P, (1 - e)Q + eQi) = 2 " 2 / ^((1 - e)dQ + edQi) 

= 2-2Vl^~e [ ^dPdQo 
Je c 

= 2-VT^~e(2- H 2 (P,Q )). 
The inequalities in (117) follow from (116) and the facts that | < \f\ — e < e and < H 2 < 2. □ 
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Lemma 8. For any probability measures (P, Q), e i— > H 2 (P, (1 — e)P + eQ) is decreasing on [0, 1] . 

Proof. Fix < e < e' < 1. Since (1 - e)P + eQ = ((1 - e')P + e'Q)f + ^P, the convexity of 
PT 2 (P,-) yields 

H 2 {{\ - e)P + eQ, P) < ^P 2 ((l - e')P + e'Q, P). □ 

We conclude this appendix by proving Lemma 1 presented in Section 2.1: 

Proof. By Lemma 8, the function /3 i— > H 2 ((3) is decreasing, which, in view of the characterization 
(25) - (26), implies that (3* < f3 . Thus it only remains to establish the rightmost inequality in 
(19). To this end, we show that as soon as j3 exceeds 1, V n (j3) becomes o(l) regardless of the choice 
of {G n }: Fix P > 1. Then 

V n (P) = TV(c&", ((1 - n -P)$ + n-^G n ) n ) 

< TV(5 ", ((1 - n-P)5 + n~%) n ) (118) 
= 1- {l-n- p ) n 

= o(l), 

where (118) follows from the data- processing inequality, which is satisfied for all /-divergences 
[9], in particular, the total variation: TV(Py,Qy) < TV(Px,Qx)> where Qy\x = Py\x * s a W 
probability transition kernel. □ 

Remark 4. While Lemma 8 is sufficient for our purpose in proving Lemma 1, it is unclear whether 
the monotonicity carries over to e h> TV(P™, ((1 — e)P + eQ) n ), since product measures do not 
form a convex set. It is however easy to see that e i-> TV(P n , ((1 — e)P + eQ) n ) is decreasing, 
which follows from the proof of Lemma 8 with H 2 replaced by TV. It is also clear that e i— > 
H 2 (P n , ((1 - e)P + eQ) n ) is decreasing in view of (23). 

B The implication of the condition (39) 

In this appendix we show that (39) implies that f3* = 1, i.e., for any (3 < 1, the hypotheses in 
(11) can be tested reliably. Without loss of generality, we assume that u > 1. Then 

r n ^G n ((y21o^,oo)) = n-°( 1 ), 

We show that the total variation distance between the product measures converge to one. Put 
A n = (-co, \j2s log n] n . In view of the first inequality in (14), the total variation distance can be 
lower bounded as follows: 

V n ((3) > <5> n (A n ) - ((1 - n-P)* + n^G n ) n (A n ). 
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Using (44), we have 



n 



l-s 



$ n (A n ) = (i - ^(v/^bg^r = i 



log n 



(l + o(l)). 



On the other hand 



((1-n 



P)<t> + n~PG n ) n (A n ) = (l-(l-n~^ 
= (1 _ n"^ 1 ) 
= o(l) 



/3 )l>(v / 27loi^) - n-^r„r 
) _ n -0-*+o(l) _ n -/9+o(l)j 



where the last equality is due toO</3<l<s. Therefore V n (f3) = 1 — o(l) for any /3 < 1, which 
proves that /3* = 1. 

In fact, the above derivation also shows that the following maximum test achieves vanishing 
probability of error: declare H\ if and only if maxj \Xi\ > \u\y/2\ogn. In general the maximum test 
is suboptimal. For example, in the classical setting (2) where G n = e> Mn , [12, Theorem 1.3] shows 
that the maximum test does not attain the Ingster-Donoho-Jin detection boundary for j5 E [|, §]. 
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